Skip to main content

Simple RAG Demo

Overall Flow

  • This project is a small RAG system implementation, it process a file from a local path and lets you ask a question related to the document.
  • Here is what the output of the program looks like.
~/ » uv run main.py                    
INFO:__main__:Started parsing document
INFO:__main__:Finished parsing, found 23577 characters
INFO:__main__:Finished chunking, created 169 chunks
Batches: 100%|███████████████████████████████████████████████████████| 4/4 [00:00<00:00, 12.10it/s]
Batches: 100%|███████████████████████████████████████████████████████| 3/3 [00:00<00:00, 15.32it/s]
INFO:__main__:Processed and stored embeddings

Please enter a question to ask: what is retrival augumented generation?

INFO:__main__:Performing semantic search
Batches: 100%|███████████████████████████████████████████████████████| 1/1 [00:00<00:00, 8.39it/s]

Retrieval-Augmented Generation (RAG) is a framework that combines the knowledge of a generative language model with an external retriever to provide accurate and relevant responses to user queries. RAG works by providing the retriever and generator work together, where the retriever retrieves relevant information from a pre-defined corpus, while the generator generates new text based on the retrieved information. This process is repeated multiple times until the generated text matches the desired output. The final output of RAG is both accurate and relevant to the user's query, ensuring that the generated content is not only informative but also engaging and useful for the user.
def main() -> None:
file_path = "Introduction to Retrieval Augmented Generation (RAG) By Weaviate.pdf"

# Document parsing
logger.info("Started parsing document")
page_content = parse_document(path=file_path)
logger.info(f"Finished parsing, found {len(page_content)} characters")

# Text chunking
chunks = chunk_text(page_content, 200, 60)
logger.info(f"Finished chunking, created {len(chunks)} chunks")

# Embedding and storage
collection = get_embedding_collection()
store_chunk_embeddings(chunks, collection)
logger.info("Processed and stored embeddings")

# Get user query
query = prompt("Please enter a question to ask")
if not query:
raise ValueError("Query cannot be empty")

# Semantic search
logger.info("Performing semantic search")
similar_results = semantic_search(query, collection)

# Process results
# info: We are ignoring results with low score. (confidence threshold)
if not similar_results or similar_results[0][0] < 0.6:
logger.warning("Not enough context found")
print("Not enough context found, please try another question")
else:
content_results = [content for _, content in similar_results]
final_answer = model_run(query, content_results)
print(final_answer)

if __name__ == "__main__":
main()

Extract Text from PDF

  • Using pymudf library we can extract texts page by page and store it in a list
def parse_document(path: str):
doc = pymupdf.open(path)

pageContent = []

for page in doc: # iterate the document pages
text = page.get_text() # get plain text encoded as UTF-8
pageContent.append(text)

return "".join(pageContent)

Break the Text Contents

  • custom chunking function, based on options passed we are creating chunks that overlap with certain window
def chunk_text(text, size, overlap):
chunks = []

for i in range(0, len(text), size - overlap):
chunk = text[i : i + size]
chunks.append(chunk)

return chunks

Embedding

  • before we store our embeddings, we first need to configure, the store, and the embedding model we like to use
  • the collection we are using is an in memory one which is erased as the python program exits
def get_embedding_collection():
embedding_model = llm.get_embedding_model("sentence-transformers/all-MiniLM-L6-v2")
collection = llm.Collection(name="entries", model=embedding_model)
return collection
  • we can now embed our chunks to the store created, llm has an option to store multiple chunks at once.
def store_chunk_embeddings(chunks, collection):
collection.embed_multi(
entries=((i, chunk) for i, chunk in enumerate(chunks)), store=True
)
  • the llm library will check every row and calculate a cosine similarity score, we pass 3 as we only want the top 3 results.
def semantic_search(query, collection):
similar_data = []
for entry in collection.similar(query, 3):
similar_data.append([entry.score, entry.content])
return similar_data

Pass To LLM

  • we use a very small orca-mini:3b (3B parameter and 2GB file size), the system prompt inform the model to consider the passed context only.
def model_run(query, results):
model = llm.get_model("orca-mini-3b-gguf2-q4_0")
context = "\n".join(results)

response = model.prompt(
f"User query: {query} and the following Context: {context}",
key="sk-...",
system="You are an AI assistant that provides answers based solely on the given context and user query. Please ensure your responses are clear, concise, and directly address the user query, including only relevant information.",
)
return response.text()